REMARKS 

In the Final Action dated February 28, 2006, claims 37-45 and 47-50 are pending and 
under consideration. Claims 37, 42, 43 and 47 are allowed. Claims 41 and 45 are rejected under 
35 U.S.C. §1 12, second paragraph, as indefinite. Claims 38-41, 44 and 48-50 are rejected under 
35 U.S.C. §112, first paragraph, for allegedly failing to satisfy the enablement requirement. 
Claims 38, 41, 44, and 48 are separately rejected under 35 U.S.C. §1 12, first paragraph, for 
allegedly failing to satisfy the written description requirement. Claims 38, 44 and 48 are rejected 
under 35 U.S.C. §102(b) as anticipated by Kiyoshi et al. (U.S. Patent No. 5,453,491). 

This Response addresses each of the Examiner's rejections. Applicants therefore 
respectfiilly submit that the present application is in condition for allowance. Favorable 
consideration of all pending claims is therefore respectfiilly requested. 

Claims 41 and 45 are rejected under 35 U.S.C. §112, second paragraph, as indefinite 
for reciting the terms "mature form" and "soluble form". 

Initially, Applicants draw the Examiner's attention to the amendments to claims 41 
and 45. Claim 41 has been amended to depend fi'om claims 38 and 40, instead of claims 37-40; 
and claim 45 has been amended to depend fi-om claims 38-40, instead of claim 37. 

The Examiner contends that the metes and bounds of a "soluble form" and a "mature 
form" of the human NR4 molecule can only be determined by knowing at which particular amino 
acid residue of SEQ ID NO: 4 the designated forms start and at what residue they end. The 
Examiner argues that even if one skilled in the art could obtain some information regarding the 
domains of human NR4 based on Figure 1, it is not disclosed in the application whether the NR4 
polypeptide contains more than one extracellular domain, transmembrane domain or cytoplasmic 
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domain. In addition, the Examiner contends that the term "mature" is ambiguous as to whether it 
relates to the activity of the polypeptide or the length of the polypeptide. In sum, the Examiner 
concludes that the information provided in the specification is insufficient for those skilled in the 
art to make a determination as to what constitutes a mature form, or a soluble form, of human 
NR4. 

Applicants respectfully submit that based on the instant disclosure, including Figures 
1 and 7 and the description of these figures, those skilled in the art would clearly understand that 
the NR4 polypeptide contains one extracellular domain, one transmembrane domain and one 
cytoplasmic domain. 

Further, Applicants respectfully submit that NR4 is clearly identified as a 
haemopoietin receptor in the specification. Haemopoietin receptors include the LIF receptor, IL- 
6 receptor and the G-CSF receptor. The basic structure of haemopoietin receptors was known 
prior to the present invention and is illustrated in Plate 2 and Plate 3 (Exhibit 1) of the 
"Guidebook to Cytokines and their Receptors", Nicos A. Nicola, A Sambrook and Tooze 
Publication, Editors, Oxford University Press, Oxford, New York, Tokyo 1994. Exhibit 1 shows 
that haemopoietin receptors have a single extracellular domain, transmembrane domain and 
cytoplasmic domain. 

As to the term "mature form", Applicants respectfully submit that the term represents 
a form of the NR4 polypeptide that is different in length fi-om the newly translated NR4 
polypeptide as a result of post-translational processing. In the context of the present invention, 
the "mature form" of the NR4 protein can be glycosylated or unglycosylated, depending on the 
expression system employed in the recombinant production of the protein. 



5 



H:\woTk\536\l 1373Z\AMEND\I 1373Z.am3.doc 



Applicants disagree with the Examiner's conclusion that a "soluble form" and a 
"mature form" of human NR4 can not be sufficiently defined without a specific disclosure in the 
specification of the precise startmg and ending amino acid residues of the respective forms. 
Applicants respectfiilly submit that based on the information provided in the specification, those 
skilled in the art would be able to make a determination of the starting and ending amino acid 
residues, or at least the approximate starting and ending positions, and to readily confirm the 
accuracy of such a determination by routine experimentation. 

Applicants submit that the detailed deconstruction of the murine NR4 (IL-13 receptor 
alpha chain) in Example 6 clearly demonstrates that appropriate means were available in 
1995/1996 to one skilled in the art to determine the signal sequence and trans-membrane regions 
of a protein. Through these means, coupled with the disclosures relating to the analysis and 
comparison of the murine and human NR4 sequences, those skilled in the art would have been 
able to determine the signal peptide and transmembrane region of hxmian NR4, at the time the 
present application was filed. 

hi this connection, Applicants submit that there are a number of publications that 
were available to one skilled in the art describing how to identify the signal peptide and 
transmembrane regions of proteins, as well as a number of programs available on the web. 

For example, Gunnar von Heijne described a method for predicting the site of 
cleavage between a signal sequence and the mature protein in 1986 using a weight-matrix 
approach. See, "yl new method for predicting signal sequence cleavage sites" Gunnar von 
Heijne, Nucleic Acid Research, 14(1 1): 4683-4690, 1986 (Exhibit 2). This was the most widely 
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used method for predicting the location of the cleavage site of signal peptides in 1995/1996. 
This was in fact the common method at the time for predicting signal peptide cleavage points. 

A paper published in 1999 reviews the PROSITE database developed in 1998 that 
consists of biologically significant patterns and profiles formulated in such a way that, with the 
appropriate computational tools, it can help determine to which known family of proteins a new 
sequence belongs to or which known domains it contains. The database was developed to enable 
the identification and fimction of uncharacterized proteins translated fi:om genomic or cDNA 
sequences. See, "The PROSITE database, its status in 1999", Hoffmann et al., Nucleic Acid 
Research, 27(1): 215-219, 1999 (Exhibits). 

Further, Applicants submit that the transmembrane region was commonly determined 
by hydrophobicity analysis. For example, the following two papers described strategies for 
predicting transmembrane topology of prokaryotic and eukaryotic membrane proteins. 

"Membrane Protein Structure Prediction, Hydrophobicity analysis and the positive 
inside rule", Gunnar von Heijne, Journal ofMoelcular Biology, 225: 487-494, 1992 (abstract 
attached as Exhibit 4); and "Predicting the Topology of eukaryotic membrane proteins"; Sipos L. 
and von Heijne G., Eur J, Biochem, 213(3): 1333-1340, 1993 (abstract attached as Exhibit 5). 

In addition, there is also a website for "Tmpred", a program that makes a prediction of 
membrane-spanning regions and their orientation. See www.ch.embnet.org/software/TMPRED 
form.html . The algorithm is based on the statistical analysis of TMbase, a database of naturally 
occurring transmembrane proteins. The prediction is made using a combination of several 
weight-matrices for scoring. TlVIbase was published in 1993 (see, K. Hofinann & W. Stoffel, 
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"TMbase - A database of membrane spanning protein segments", BioL Chem., Hoppe-Seyler, 
374: 166, 1993, Exhibit 6). 

Accordingly, Applicants respectfully submit that a "soluble form" and a "mature 
form" of human NR4 are sufficiently defined in the specification, despite the absence of a 
specific disclosure of the precise starting and ending amino acid residues of the respective forms. 
In view of the foregoing, Applicants respectfully submit that the rejection of claims 41 and 45 
under 35 U.S.C. §112, second paragraph, is overcome. Withdrawal of the rejection is 
respectfully requested. 

Claims 38-41, 44 and 48-50 are rejected under 35 U.S.C. §112, first paragraph, for 
allegedly failing to satisfy the enablement requirement. Claims 38, 41, 44, and 48 are separately 
rejected under 35 U.S.C. §112, first paragraph, for allegedly failing to satisfy the written 
description requirement. The Examiner*s principal concem appears to be directed to the 
recitation of "a part or fragment of SEQ ID NO: 4" in independent claim 38. Essentially, the 
Examiner contends that the recited "part or fi-agment of SEQ ID NO: 4" is not defined by any 
structural or functional feature. 

Applicants respectfully submit that the specification does provide guidance for "parts" 
and "fragments" of NR4. For example, Example 6 of the specification (page 37 and Figure 1) 
defines the various domains of murine NR4 including a signal sequence, an extracellular domain, 
a transmembrane domain, and a cytoplasmic domain. Example 1 1 (pages 39-40) discloses that 
SEQ ID NO: 4 is the human homolog of murine NR4 with 75% similarity at the amino acid 
level; and Figure 7 aligns the human sequence with the murine sequence. In view of the 
disclosure in the specification, the term "part or fi'agment" of SEQ ED NO: 4, as presently recited. 
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is fiilly supported by the specification. As such, the enablement and written description 
rejections under 35 U.S.C. §112, first paragraph, are overcome. Withdrawal of the rejections is 
respectfully requested. 



Claims 38, 44 and 48 are rejected under 35 U.S.C. §102(b) as anticipated by Kiyoshi 



et al. (U.S. Patent No. 5,453,491). The rejection is apparently made based on the Examiner's 
interpretation of the term "a part" of SEQ ID NO: 4 as reading on an amino acid, which is 
disclosed by Kiyoshi et al. 



Applicants respectfully submit that in light of the specification, those skilled in the art 



would not interpret the term "part" of SEQ ID NO: 4 to include simply an amino acid. As 
described in the specification, a derivative of an NR4 molecule, which includes parts or 
fi^agments of an NR4 molecule, can be a functional molecule, e.g., capable of binding to IL-13, or 
a non-fiinctional but immunogenic molecule. See page 7, lines 1-13 of the specification. Those 
skilled in the art would not consider a single amino acid as "a part" of SEQ ID NO: 4. 
Accordingly, the § 102(b) rejection based on Kiyoshi et al. is overcome and withdrawal thereof is 
respectfully requested. 



In view of the foregoing amendments and remarks, it is firmly believed that the 



subject application is in condition for allowance, which action is earnestly solicited. 



Scully, Scott, Murphy & Presser, P. C. 
400 Garden City Plaza-STE 300 
Garden City, NY 11530 
(516) 742-4343 
XZ:ab 

Ends.: Exhibits 1-6 




Xiaochun Zhu 
Registration No. 56,3 1 1 
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Plate 2. Schematic view of the haemopoielin/interferon receptor family. 

Abbreviations: IL = interieukin, epo = erylhropoieiin, GH = growth hormone, PL = prolaclin, CNTF = ciliary 
neurotrophic factor. G-CSF = granulocyte colony-stimulating factor, GM-CSF = granulocyte-macrophage colony- 
stimulating factor, COMM = common receptor chain to IL-3, II.-5, and GM-CSF, LIF = leukaemia inhibitory factor, 
IFN = interferon, R = receptor. 

(This figure was adapted from Figure 1 of Gearing and Zieglcr 1993.) 
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Plate 5. Shared receptor subunils may contribute to cytokine plciotropy. 
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ABSTRACT 

A new method for identifying secretory signal sequences and for predicting 
the site of cleavage between a signal sequence and the mature exported 
?rini". described. The predictive accuracy is estimated to be around 
75- w for both prokaryotlc and eukaryotic proteins. 

INTRODUCTIOIt 

The transient N-ternlnal signal sequence found on most secretory proteins 
serves to initiate export across the inner membrane (In prokaryotes) or the 
endoplasmic reticulum (in eukaryotes). Three structurally and, possibly, 
functionally distinct regions have been identified as the basic 
bulldlng-blocks of a secretory signal sequence: a basic H-ternlnal region 
(n-reglon). a central hydrophobic region (h-reglon). and a more polar 
C-termlnal region (c-reglon) (1). The structural deterioinants for cleavage of 
the signal sequence from the nature protein once export is under way seeos to 
reside In the n- and h-reglons. with positions -3 and -1 relative to the 
cleavage site being the most important ones (2.3). Indeed, thla 
"(-3,-1 )-rule'' has been used quite successfully to predict the oost likely 
site of cleavage directly from the primary sequence (2). 

In view of the great interest in secretory proteins and the fact that 
most such proteins are known only from their DNA sequence, it is important to 
assess and, if possible, to improve upon the predictive accuracy of the 
original method. In this paper, i present a new scheme based on a 
weight-matrix approach that can be expected to give correct predictions ai>out 
75-80$ of the time when applied to new sequences (both prokaryotic and 
eukaryotic). This represents a substantial gain over the old method, which is 
shown to be around 65% and Il51t accurate for eukaryotic and prokaryotic 
proteins, respectively. 
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METHODS 

)6l eukaryotlc and 36 prokaryotlc non-homologous signal sequences with knovn 
cleavage sites were chosen from ray collection of signal sequences totalling 
at the present time some ^50 eukaryotlc and 80 prokaryotic entries. The 
prokaryotlc sample did not Include any sequences known to be cleaved by the 
lipoprotein signal peptidase (signal peptidase II) (4). 

Melght-raatricea W(a.l) (see below) were calculated from the observed 
amino acid counts in each position. H(a.l). (i.e. the number of residues of 
type a in position i) with all sequences aligned fron their known site of 
cleavage between positions -1 and by first dividing all counts by their 
respective expected abundance In proteins in general, <N(a)> (Tables ) & 2, 
last column), and then taking the natural logarithms of these quotients: 
Vl(a,i) - ln(N(aa)/<N(a)>). To correct for the limited size of the data base, 
all zero-elements In the amino acid count matrices were put equal to one 
before the divison. Zero-counts In positions -3 and -1 were treated 
dlfferentlyi they were also put equal to one. but then divided by the total 
number of sequences in the sample, N, rather than the expected number of 
residues, e.g. W(a.-t) - ln(l/N) If N(a,-1) - 0. 

The most probable cleavage site was identified by scanning the sequence 
in question with the appropriate weight-matrix and summing the weights for 
each position, i.e. S(i) - W(aj.p.i-p) ♦ wCai-p., . i-pM ) * . . . . 
W^*l+q.l*q) where the sutnmation window extends from position to j^q . 
The predicted cleavage site is the one with the highest S-value. S(j) - 
max{S(l)! i«1-p....L-q]. where L is the length of the sequence analyzed. 
As shown belov, maximum predictive accuracy was obtained for p--l2 and q-2. 

RESULTS 

"Hie (-3i-1)-rule 

Based on previous statistics (2), acceptable cleavage sites were suggested to 
conform to the following rules: the residue in position -I must be small, 
i.e. either Ala. Ser, Gly. Cys, Thr, or Gin; the residue In position -3 must 
not be aromatic (Phe, His, Tyr, Trp), charged (Asp, Glu, Lys, Arg). or large 
and polar (Asn, Cln). Further, it was suggested that Pro must be absent froo 
positions -3 through +1. The new amino aold counts presented in Tables I & 2 
are based on more than twice as many sequences: nevertheless, the 
(-3,-1)-rule la seen to hold remarkably well. The only exceptions found to 
date among eukaryotlc proteins are one sequence with Leu in -t, ^ Mtth fro 
In -2, and three with Pro In -1, Thue, barring sequencing errors, we oust 
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Table 1 Amino acid counts for eukaryotto ai^nal sequences 

The average composition (last column) is fpon Ref.(IO) 
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adnit the possibility that residues other than the classical ( -3,-1 )-klnds 
can be used In position -1 » but only when no better cleavage site is 
available in the vicinity (this is true for all five exceptions). 

A few other points can also be tnade. First, the constraints on the 
prokaryotic sequences in the (-3,-1 )-re8lon seen even stronger than for the 
eukaryotic ones: only Ala. Cly. Ser and Thr have been found in -1, and only 
Ala. Cly. Leu, Ser, Thr. and Val in -3. Second, Leu is abundant in the 
prokaryotic sample up to and including position -8, but its Incidence drops 
precipitously in position -7. where it is replaced by the likewise 
hydrophobic but less strongly helix-inducing residues Val and Phe. Only from 
position -6 do we find predominant ely polar residues. Finally, there is a 
notable imbalance between the basic residues Arg and Lys in the c-region of 
the eukaryotic signal sequences, with 26 Arg and only 6 Lys (Arg/Lys • 4.3), 
This Is la sharj^ contra to the n-region where ftry/Lys » 66/72 - 0.9 and to 
proteins in general where the expected ratio is 0.6 (Table 1» last column). 
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Table 2 mino aotd oountd for prokaryotle sLcnai saqoMiotts 

The average oonposltlon (last column) Is froo Ref.(IO) 





-13 


-12 


-1 1 


-to 


-9 


-8 


-7 


-6 








"2 


_^ 










10 


a 


B 


9 


5 


7 


c 
? 


£ 
D 


7' 
f 






9 
C 


'31 

31 


1 A 




3-2 


c 


} 


0 


0 




1 


0 


Q 




1 

1 




u 


ft 
U 


u 


ft 
u 


ft 

u 


1.0 


D 


0 


0 


0 


0 


0 


0 




Q 


0 


A 

v 


o 
u 




A 
U 


c 


Q 
O 


O A 


B 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 






Q 




3 


9 5 
£ > c 


P 


2 


^ 


3 


n 


1 


\ 


8 


0 


i| 


1 


0 


7 
f 


0 




Q 


1 Q 


G 


u 


2 


2 


2 


J 


5 


2 




2 


2 


Q 


c 


C 




A 
V 


O 7 

c» 1 


H 


0 


0 


1 


0 


0 


0 


0 


1 


1 


0 


0 


7 


0 


1 


0 


0.8 


I 


3 


1 


5 


1 


5 


0 


1 


3 


0 


0 


0 


0 


0 


0 


2 


1.7 


K 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


2 


0 


3 


0 


2.5 


t 


8 


11 


9 


8 


9 


13 


1 


0 


2 


2 


1 


2 


0 


0 


1 


2.7 


M 


0 


2 


1 


1 


3 


2 


3 


0 


1 


2 


0 


1] 


0 


0 


1 


0.6 


H 


0 


0 


0 


0 


0 


0 


0 


1 


1 


1 


0 


3 


0 


1 




1.6 


P 


0 


t 


1 


1 


1 


1 


2 


3 


5 


2 


p 


0 


0 


0 


5 


1.7 


Q 


0 


0 


0 


0 


0 


0 


0 


0 


2 


2 


0 


3 


0 


0 


1 


\A 


R 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


1.7 


S 


1 


0 


1 




4 


1 


5 


15 


5 


8 


5 


2 


2 


0 


0 


2.6 


T 


2 


0 


tl 


2 


2 


2 


2 


2 


5 


1 


3 


0 


1 


1 


2 


2.2 


V 


5 


7 


1 


3 


1 


IJ 


7 


0 


0 


4 


3 


0 


0 


2 


0 


2.5 


W 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0.4 


I 


0 


0 


0 


0 


0 


0 


0 


0 


0 


3 


0 


1 


0 


0 


0 


1.3 



Construction of welght-satrlces 

Weight-matrix oiethods have been used for a number of years to locate signals 
m nuclelo acid sequences (see (5) for a thorough discussion). Their use for 
pattern recognition in protein sequences requires a larger data base (20 
anino acids rather than 4 bases must be scored for in each position)* but la 
no different in principle. Basically, one converts the observed number of 
each kind of residue in each position in a sample of aligned "signal9» into-a 
measure of the probability of finding that particular kind of residue in tlnlt 
particular position - the probability welght-Batrlx - by a suiubli 
normalization. Then, any new sequence can be scanned by a moving #in<M 
(looking up the reapeotive probabilities in the weight-ttatrix ^d multiply^ 
together for each position of the window) to gat a Masure of thft fit to tht 
sample used in the construotion of the weight-matrix. The highest-aoorliig 
window-position is then taken as the prediction for the location of. ttio 
signal, if the score is above some Dinioiui value. 
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To score for possible signal sequence function, and to locate the most 
probable cleavage site in a putative signal sequence, weight -ma trices for 
prokaryotlo and eukaryotlc signal sequences were constructed as follows. The 
raw amino acid counts for the two samples (Tables 1 & 2} were divided by the 
expected number <N(a)> of each kind of residue given amino acid frequencies 
as in soluble proteins in general (last columns). Except for positions -3 and 
-1 relative to the cleavage site, all matrix elements with zero counts were 
normalized as 1/<N(a)>. For positions -3 and -1, where there is good reason 
both from previous statistical and experimental studies to believe that only 
a subset of all residues are allowed (2,6), the more stringent normal Uat Ion 
1/N was used for the zero-count elements (where N is the total nunber of 
sequences in the sample). The final weight-matrix was obtained by taking the 
natural logarithns of the normalized values, thus reducing the ensuing 
probability calculations to summations rather than multiplications of the 
weight -matrix elements. 
Aaseasaent of the predictive accuracy 

Uhen the two we Ight^ma trices were used to predict the cleavage sites in the 
samples used in their construction, virtually all sites were correctly 
identified (87> in the eukaryotlc sample, 100$ in the prokaryotlo sample). 
However, these sequences are at an advantage relative to sequences not 
included in the matrix: when correctly aligned with the weight-matrix, all 
residues in a sequence Included in the weight -matrix sample will correspond 
to a count, and a spuriously high predictive accuracy will be found. 

To avoid this problem, the eukaryotlc sample was divided into 7 
subsamples, each of 23 sequences. For each sub sample, the remaining 138 
sequences were used to construct a new weight-matrix, and this matrix was 
then applied to the subsample. Similarly, the prokaryotlc sample was divided 
into 4 subsamples, each of 9 sequences. All subsequent calculations were 
carried out by summing the results for the subsamples. 

Height -ma trices Including positions -15 to *5 were first used to 
determine the effect of Ignoring residues at either end in the predictions. 
It was found that positions -13 to ^2 were sufficient to obtain maximal 
predictive accuracy (for the prokaryotlc sample, positions -5 to ♦2 were 
sufficient but the full -13 to ^2 range was used nevertheless): with this 
choice, 125 out of 161 eukaryotlc and 32 out of 36 prokaryotlo cleavage sites 
and 895) were correctly Identified with a standard deviation of about 
±101 in each case. For an additional 19 eukaryotlc and 2 prokaryotlc 
sequences, the correct site had the second-highest score. These values should 
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be ecmpared with the predictive accuracy of the older method (as Iflipleoented 
in a program kindly communicated by Dr. H.S. Ip, Rockefeller University). 
When this method was applied to the 121 sequences In the eukaryotic aaople 
that were not included in the original statistics (2). 77/121 (61S) of the 
known cleavage sites were correctly identified, and only 17/36 («I7<) of the 
prokaryotlc ones were found. 

With -13 to +2 weight-fsatrices, the contribution to the overall succeaa 
from individual positions was also investigated. Only positions -3 and -1 had 
any strong Impact; when one or the other was left out in the calculations the 
percentage of correctly Identified eukaryotic sites dropped to 61$ and 535, 
respectively (81$ and 69$ for the prokaryotlc sample). 

As has been shown previously (1,7), residues -13 to -6 correspond to the 
h-region In the "average'* eukaryotic signal sequence, residues -5 to -1 
correspond to the c-reglon, and residues +1 and +2 seem to be selected such 
that few alternative cleavage sites should exist In the vicinity of the 
correct one (I.e. residues -5 to *2 can be included In an extended c-reglon). 
Thus, it is poaaiblo to calculate the scores for the h- and c-reglons 
separately by suranlng the contributions from positions -13 to -6 and -5 to 
♦2. respectively. As shown In Fig.l, the average h-reglon score for the 
eukaryotic sample Xncreasea^ aiowly aa tiie tflndow la novetf up to poaitlon -1 
(the known cleavage site), and then decreases. The average c-reglon score 
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show3 a more dramatic behaviour, with a pronounced peak in position -1 and 
troughs In positions -2 and reflecting the match to the (-3,-1)-pattern 
and the tendency to have residues In position -2 that do not fit this pattern 
(see Tables 1 4 2). Siallar curves are obtained for the prokaryotlc sample 
(not shown). 

Interestingly, 35 out of the 36 erroneous predictions for the eukaryottc 
sequences fall on the N-termlnal side of the correct cleavage site, mostly In 
the region -6 to -3 (30/36). About half of these result from matches with a 
higher score In the h-reglon but a lower one In the c-reglon than calculated 
for the correct site, whereas only 6 out of 36 have higher c- and louer 
h-reglon scores than the correct site. I have thus tried to improve the 
predictive accuracy in various ways, e.g. by multiplying the -3 and -1 
welgtha or the whole o-region score by an extra factor, or by allowing a 
small variation in the distance between the h- and c-reglons, but have not 
been able to obtain iBore than marginal Improvements on the order of In 
the overall success-rate • 

The eethod described here not only allows prediction of the most likely 
cleavage site in new signal sequences, it also makes it possible to 
discriminate quite efficiently between putative signal sequences and the 
N~termlnal regions of cytosollc proteins. The distribution of maximum scores 
for the eukaryotlc signal sequences la shown in Fig. 2, together with the 
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corresponding distribution obtained for a sample of 132 ^lO-residues long 
N-temlnal regions of cytosolic eukaryotlc proteins (8). Only 3/161 (2%) of 
the signal sequences have (naxitnuoi scores < 3-5; conversely, only 2/132 (2%) 
of the cytosolic sequences have maximum scores > 3.5. This level of 
discrimination compares favourably with that obtained with a recently 
published signal-sequence detecting algorithm (9). 

DISCUSSIOH 

Using a standard weight-matrix approach easily implemented even on a 
micro-computer, it is possible to set up a prediction method that (I) 
provides a clean discrimination between signal sequences and the N-terninal 
region in cytosolic proteins, and (ii) can be expected to identify the 
correct cleavage site 75-80S of the time when applied to new sequences not 
included in the data base (both prokaryotic and eukaryotlc). This represents 
a significant improvement over previous methods. 

Since the first submission of this work, another 36 eukaryotlc signal 
sequences with known cleavage sites have been added to the data base. Using 
the same weight -matrix as above (Table 1), 15% of these sites were correctly 
predicted. 
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ABSTRACT 

The PROSITE database (hnp://www.expasy.ch/sprot/ 
prosite.html ) consists of biologically significant pat- 
terns and profiles formulated in such a way that with 
appropriate computational tools it can help to deter- 
mine to which l<nown family of protein (if any) a new 
sequence belongs, or which known domain(s) it 
contains. 

BACKGROUND 

PROSITE ( ! /) is a metliod of identifying what is tlie flinction of 
unciiaracterized proteins translated from genomic or cDNA 
sequences. It consists of a database of biologically significant 
patterns and profiles foitnulated in such a way that with 
appropriate computational tools it can rapidly and reliably 
determine to which known family of protein (if any) the new 
sequence belongs, or which known domain(s) it contains. 

In some cases the sequence of an unknown protein is too 
distandy related to any protein of known stmcture to detect its 
resemblance by overall sequence alignment However, relation- 
ships can be revealed by the occurrence in its sequence of a 
particular cluster of residue types, which is variously known as a 
pattern, motif, signature or fingerprint. These motift arise 
because specific region(s) of a protein which may be important, 
for example, for their binding properties oj- for their enzymatic 
activity are conserved in both structure and sequence. These 
structural requirements impose very tight constraints on the 
evolution of this small but important portton(s) of a protein 
sequence. The use of protein sequence patterns or profiles to 
determine the function of proteins is becoming very rapidly one 
of the essential tools of sequence analysis. Many authors (3,4) 
have recognized this reality. Based on these observations, we 
decided in 1 988, to actively pursue the development of a database 
of regular expression-like patterns, vA\\dt\ would be used to search 
against sequences of unknown function. 

But, while sequence patterns are very usefiiU there are a number 
of protein families as well as functional or structural domains that 
cannot be detected using patterns due to their extreme sequence 
divergence. Typical examples of important functional domains, 
which are weakly conserved, are the globins, the immunoglobu- 
lin, and die SH2 and SH3 domains. In such domains there are only 



a few sequence positions which are well conserved. Any attempt 
to build a consensus pattern for such regions will either fail to pick 
up a significant proportion of the protein sequences that contain 
such a region (false negatives) or will pick up too many proteins 
that do not contain the region (false positives). 

The use of techniques based on profiles or weight matrices (the 
two terms arc used synonymously here) allows the detection of 
such proteins or domains. A profile is a l<nblc of position-specific 
amino acid weights and gap costs. Tliese numbers (also referred 
to as scores) are used to calculate a similarity score for any 
alignment benveen a profile and a sequence, or parts of a profile 
and a sequence. An alignment with a similarity score higher than 
or equal to a given cut-off value constitutes a motif occurrence. 
As with patterns, there may be several matches to a profile in one 
sequence, but multiple occurrences in the same sequences must 
be disjoint (non-overlapping) according to a specific defmition 
included in the profile. Another feature that distinguishes patterns 
from profiles is that the latter are usually not confined to small 
regions with high sequence similarity. Rather tliey attempt to 
characterize a protein family or domain over its entire length. 

We therefore started in 1 994 to complement the approach based 
on patterns by gradually adding to PROSITE profile entries. The 
profile strucnire (5,6) used in PROSITE is similar to but slightly 
more general than the one introduced by Gribskov and co- 
workers ( V); additional parameters allow representation of other 
motif descriptors, including the currently popular hidden Markov 
models {< ). Profiles can be constmcted by a large variety of 
different techniques. Theclassk:al method developed by Gribskov 
and co-workers (7) requires a multiple sequence alignment as 
input and uses a symbol comparison table to convert residue 
frequency distributions into weights. Most profiles included in 
PROSITE are goierated by this procedure applying recently 
described modifications (iO,U). In some cases we also applied 
alternative profile construction methods including stmcture-based 
approaches and methods involving hidden Markov modelling. 

LEADING CONCEPTS 

The design of PROSITE follows ftve leading concepts. 

Completeness. For such a compilation to be helpful in the 
detemiination of protein function, it is important that it contains 
as many biologically meaningful patterns and profiles as possible. 
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Figure 1. Sample data from PROSITE. 



High specificity. In the majority of cases we have chosen patterns 
or profiles that art specific enough that they do not detect too 
many unrelated sequences, yet they will detect most, if not all, 
sequences that cleariy belong to the set in consideration. 

Documentation. Each of the entries in PROSITE is fully 
documented; the documentation includes a concise description of 
the protein family or domain that it is designed to detect as well 
as a summary of the reasons leading to the development of the 
pattern or profile. 

Periodic reviewing. It is important that each entry be periodically 
reviewed to ensure that it is still valid. 

A very tight relationship with the SWJSS-PROT protein sequence 
data bank (! 2). Updating of PROSITE and of the annotations of 
the relevant SWISS-PROT ©ilries are very often done in parallel. 



Sofhvare tools based on PROSITE are used to automatically 
update the feature table lines of SWISS-PROT entries relevant to 
the presence and extent of specific domains. 

FORMAT AND DOCUMENT FILES 

The core of the PROSITE database is composed of two ASCII 
(text) files. The first file (PROSITE.DAT) is a computer-readable 
file that contains all the information neces^ry for programs that 
make use of PROSITE to scan sequencc{s) for the occurrence of 
the patterns and/or profiles. This file also includes, for each entry 
described, statistics on the number of hits obtained while 
scanning for that pattern or profile in SWISS-PROT. Cross-refer- 
ences to the corresponding SWISS-PROT entries are also present 
in the file. The second file (PROSITE,DOC), which we call the 
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Figure I. continued 



textbook, contains textmil information that documents each 
pattern. 

A sample textbook entry is shown (Fig. I a); this particular entry 
is linked to two entries in the PROSITE.DAT file: a pattern and 
a profile (Fig. lb). 

Several document files are also distributed with the database: 

PROSUSER.TXT The database user's manual 
PR0F1LE.TXT A detailed description of the syntax for the 
profiles 

PROSITE.LIS A list of PROSITE documentation entries 
PROSITE.GET A document on how to obtain a local copy of 
PROSITE 

PROSITE.PRG A description of p«)grams and electronic 
mail servers that make use of PROSITE 

PAimNDX.TXT An index of authors cited in the 
PR0S1TE.D0C file 

CONTENT OF THE CURRENT RELEASE 

Release 1 5.0 of PROSITE (July 1998) contains 10 14 documenta- 
tion entries describing 1352 different patterns, rules and profiles/ 



matrices. In addition to these entries, a collection of 241 
preliminary profiles is available in the pre-release distribution 
from the FTP server of the ISREC group (see below). The list of 
the documentation entries that have been added since the last 
release of PROSITE (14.0) is provided in Table ! , furthermore, 
many entries were updated. The database requires -5 Mb of disk 
storage space. The present distribution frequency is two releases 
per year. No restrictions are placed on use or redistribution of the 
data. Future releases of PROSFTE will be copyright (releases up 
to number 15.0 are not). 

HOW TO OBTAIN A LOCAL COPY OF PROSITE 

By CO-ROM 

PROSITE is distributed on CD-ROM by the EMBL Outstation— 
the European Btoinformatics Institute (EBl) (i.l). For all 
enquiries regarding the subscription and distribution of PROSITE 
one shouM contact The EMBL Outstation— The European 
Bioinformatics Institute, Wellcome Trust Genome Campus, 
Hinxton, Cambridge CBIO ISD, UlC Tel: +44 1223 494 444; 
Fax: +44 1223 494 468; Email: datalib@ebi.ac.uk 
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Tabic 1. List of patterns documentation entries that lia\'c been ntlded since 
the last release of PROSITB ( 1 4.0) 



DNA repair protein radC family signature 
recR protein .signature 

ubiH/CCXJ6 moRooxygcnase family signature 
ATP pho^pliorihosyltrdnsfcrase signature 
Prolipoprotctn diacylglyceiyl imnsfeiusc signature 
Phosplintidatc cytidylyltrsnsferasc signature 
Lipoatc-protcin ligasc 6 signature 
moaA / nifB / pqqC family signature 
BCCT family of transporters signature 
Flagellar motor protein mot A family signature 
Protctn sccA ^gnatures 
ATPtGI / PtM / MATH family signature 
Protein smpB signaluie 

Uncharaclcrized protein family UPF0044 signature 
Uncharacterizcd protein ffhmily UPRX>47 signature 

Uncharadcrizcd protein family UPF0054 sigrettirc 
UncharncterizeU protein family UPF00S7 signature 



By anonymous FTP 

If you have access to a computer system linked to the Internet you 
can obtain PROSITE using FTP (File Transfer Protocol), from 
the following file servers: 

ExPASy (Expert Protein Analysis System) server, Swiss 
Institute of Bioinformatics (SIB); Internet address: ftp://www. 
expasy.ch/databases/prosite/ 

ISREC (Swiss Institute for Experimental Cancer Research) 
anonymous FTP server, Swiss Institute of Bioinformatics (SIB); 
Internet address: ftp://flp.i5rec.isb-sib.ch/sib-isrec/proi1les/ 

EBI (European Bioinformatics Institute) anonymous FTP 
server; Internet address: flp://ftp.ebi.ac.uk/puh/databases/pro5tte/ 

The pre-release collection of profiles is only available from the 
ISREC FTP scrvcn 



By Email through the EBI network fi Icservcr 

PROSITE can be obtained from the EBI network fileserver. 
Detailed instructions on how to make the best use of this service, 
and in particular on how to obtain PROSITE, can be obtained by 
sending to the netwoik address netserv@ebi.ac.uk the following 
message: 
HELP 

HELP PROSITE 
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HOW TO RAAKE USE OF PROSiTE 

Computer programs 

Many academic groups and commercial companies have devel- 
oped computer programs that make use of the pattern entries in 
PROSITE. The *PR0S1TE.PRG' file contains a full list of these 
programs, their operating system specificity, characteristics as 
well as information on how to obtain them. 

Tv^'o softw'are packages are distributed to make use of profile 
entries: 

(\)pfloois (version 2.1 in FORTRAN??) written by Philipp 
Bucher. pfscan toads a sequence from a file and scans it with all 
(or one) of PROSITE profiles; pfsearch loads a profile from a file 
and scans for it in a SWISS-PROT database file. These tools are 
available by anonynwus FTP from the server: flp://ftp.isrec. 
isb-sib.ch/sib-isrec/pf^ools . Seveml versions are available, as 
well as executablcs compiled for many iinix platforms and for 
Windows 95m. 

(ii) Prjlib (version 1,0 in ANSI C) written by Nicolas Moeri. 
scan4prf loads a sequence from a file and scans it with all (or one) 
of PROSITE profiles; srch4prf loads a profile from a file and 
scans for it in a SWfSS-PROT database file. ITiese tools are 
available from the server: http://mamac29.epfl.ch/ 

Email servers 

There are many Email servers that are available to molecular 
biologists ( i i). This an example of a server taking advantage of 
the PROSITE database: 

Name: MOTIF E-Mail Server on 

GenomeNet 

Organization: Supercomputer Laboratory, 

Kyoto Institute for Chemical 
Research, Japan 

Description: Allows to rapidly compare a new 

protein sequence against all pat- 
terns stored in PROSITE as well 
as in the MotifDic library ( 1 :). 
Server email address: motif@genome.adjp 
Address to report problems: motif-manager@genome.ad.jp 

Interactive access to PROSITE u.sing the World Wide Web 

The most efficient and user-friendly way to browse interactively 
in PROSITE as well as to analyze a sequence for the occurrence 
of a pattern or a profile is to use the Worid-Wide Web (WWW) 
molecular biology server ExPASy ( I fi). Using a WWW browser, 
one has access to all the hypertext documents stored on the 
ExPASy server (as well as many other WWW servers) and also 
can make use of many sequence analysis software tools. 

The ExPASy server may be accessed throu^ its URL which is: 
httpy/www.expasyxh/ . You can directly access to the 'top* page 



of the section of ExPASy that allows you to browse through the 
PROSITE documentation and data entries by opening the URL: 
http://www,expasy.ch/sprot/prosite.html 

To use the PROSITE patterns and profiles, you can make use 
of the following sofhvare tools. 

ScanPtnshe. Allows the user to either scan a protein sequence — 
from SWISS-PROT or provided by the user — for the occurrence 
of patterns stored in PROSITE or to scan the SWISS-PROT 
and/or TrEMBL database — including weekly releases — for the 
occurrence of a pattern that can originate from PROSITE or be 
provided by the user. The URL for ScanPrositc is: htlp:/Avww. 
expasy.ch/sprot/scnpsite.html 

P/vJiieScan. Allo\^'s the user to scan a protein sequence — from 
SWISS-PROT or provided by the user— for the occurrence of 
profiles stored in PROSITE. The URL for ProfileScan is: 
htlp://w\w.isrec.isb-sib.ch/sofhvare/PFSCAN_fomi.htmI 

FramePivfileScan. Allows the user to scan a DNA sequence 
(translated on the fly into protein)— from EMBL or provided by 
the user— for the occurrence of profiles stored in PROSITE. The 
URL for FrameProfileScan is: http:/Av\v\v.isrec.isb-sib.ch/ 
sofiware/PFRAMESCAN Ibnn.html 
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Abstract 

A new strategy for predicting die topology of bacterial inner membrane proteins is proposed 
on the basis of hydrophobicity analysis, automatic generation of a set of possible topologies 
and ranking of these according to the positive-inside rule. A straightforward implementation 
with no attempts at optimization predicts the correct topology for 23 out of 24 inn^ 
membrane proteins with experimentally determined topologies, and correctly identifies 135 
transm^brane segments with only one overprediction. 

Author Keywords: membrane protein; protein structure; prediction 
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Predicting the topology of eukaryotic membrane proteins. 

Sipos Ll, YQnii0ijii&£. 

Department of Theoretical Physics, Royal Institute of Technology, 
Stockholm, Sweden. 

We show that the so-called 'positive inside' rule, i.e. the observation that 
positively charged amino acids tend to be more prevalent in cytoplasmic 
than in extra-cytoplasmic segments in transmembrane proteins [von 
Heijne, G. (1986) EMBO J. 5, 3021-3027], seems to hold for all polar 
segments in multi-spanning eukaryotic membrane proteins irrespective 
of their position in the sequence and hence can be used in conjunction 
wifli hydrophobicity analysis to predict their transmembrane topology. 
Further, as suggested by others, we confirm that the net charge 
difference across the first transmembrane segment correlates well with 
its orientation [Hartmann, E., Rapoport, T. A. and Lodish H. F. (1989) 
Proc. Natl Acad. Sci. USA 86, 5786-5790], and that the overall amino- 
acid composition of long polar segments can also be used to predict their 
cytoplasmic or extra-cytoplasmic location [Nakashima, H. and 
Nishikawa, K. (1992) FEBS Lett. 303, 14M46]. We present an 
approach to the topology prediction problem for eukaryotic membrane 
proteins based on a combination of these methods. 
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TMpred - Prediction of Transmembrane Regions 

and Orientation 

The TMpred program makes a prediction of membrane-spanning regions and their 
orientation. The algorithm is based on the statistical analysis of TMbase, a database 
of naturally occuring transmembrane proteins. The prediction is made using a 
combination of several weight- matrices for scoring. 

K. Hofmann & W^Stoffel (1993) 

TMbase - A database of membrane spanning proteins segments 
Biol. Chem. Hoppe-Seyler 374,166 

For further information see the TMbase and TMpredict documentation. 



Usage: Paste your sequence in one of the supported format s into the sequence 
field below 

and press the "Run TMpred" button. 

Make sure that the format button (next to the sequence field) shows the correct 
format 

Choose the minimal and maximal length of the hydrophic part of the 
transmembrane helix 
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Input^ 

sequence) Plain Text 
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MF C-35 A Database of Membrane Spanaing Protein Segments 
K. Hofmann and W. Stoffel 

Institut fur Biochemie, Medizinische Fakultat, Universitat zu Koln, Koln, FRG 

A database of all protein segments that are reported to span a membrane has been 
extracted from SwissProt 22. This sub-database consists of several tables that can 
be used with any relational database management system. The information stored 
within the database contains besides the sequence itself both annotational items 
extracted from SwissProt and additional data fields calculated fix)m the sequence or 
taken from other sources. Important data fields include, for exaniple, the putative 
transmembrane sequence, the sequence of the flanking regions, taxonomic 
information, the presumed orientation of the segment, calculated values for 
hydrophobicity and hydrophobic moment, and grouping into families by either 
functional or sequence relatedness of the proteins. 

This database together with a set of related programs has been used to analyze the 
presumed transmembrane segments for positional preferences of amino acid 
residues. The influences of neighbouring residues, membrane protein classification, 
taxonomic classification and segment orientation on these positional preferences 
have been studied. 
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